Project Overview

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Review criteria

  1. Does the link lead to an HTML page describing the exploratory analysis of the training data set?
  2. Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
  3. Has the data scientist made basic plots, such as histograms to illustrate features of the data?
  4. Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
library(stringr)
require(quanteda)
require(readtext)
library(R.utils)
library(ggplot2)

set.seed(3301)

Task 0: Understanding the problem

Tasks to accomplish

  1. Obtaining the data - Can you download the data and load/manipulate it in R?
  2. Familiarizing yourself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process you have learned in the Data Science Specialization.

Questions to consider

  1. What do the data look like?
  2. Where do the data come from?
  3. Can you think of any other data sources that might help you in this project?
  4. What are the common steps in natural language processing?
  5. What are some common issues in the analysis of text data?
  6. What is the relationship between NLP and the concepts you have learned in the Specialization?

Dwonload data.

source("downloadData.R")

attach(downloadData(file.path("..", "data")))
c(blogs, twitter, news, badwords)
## [1] "../data/final/en_US/en_US.blogs.txt"  
## [2] "../data/final/en_US/en_US.twitter.txt"
## [3] "../data/final/en_US/en_US.news.txt"   
## [4] "../data/bad-words.txt"

First, loading in files using my poor implementation is below.

tweets <- 0
wordsTwitter <- 0
sentencesTwitter <- 0
con <- file(twitter, "r")
while (FALSE && length(oneLine <- readLines(con, 1, warn = FALSE)) > 0) {
        tweets <- tweets + 1
        if(tweets <= 10) {
                print(oneLine)
        }
        words <- str_split(oneLine, "\\s+")[[1]]
        symbols <- rep(FALSE, length = length(words))
        for(i in 1:length(words)) {
                symbols[i] <- grepl("^[^a-zA-Z0-9]+$", words[i])
                if(grepl("^[0-9]+$", words[i])) {
                        words[i] <- "[numbers]"
                }
        }
        wordsPerLine <- length(simpleWords <- words[!symbols])
        for(i in 1:length(simpleWords)){
                if(grepl("[.!?]$", simpleWords[i])) {
                        sentencesTwitter <- sentencesTwitter + 1
                }
        }
        wordsTwitter <- wordsTwitter + wordsPerLine
        remove(simpleWords, words)
}
close(con)

tweets
wordsTwitter
sentencesTwitter
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [1] "they've decided its more fun if I don't."
## [1] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [1] "Words from a complete stranger! Made my birthday even better :)"
## [1] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [1] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
## [1] "I'm coo... Jus at work hella tired r u ever in cali"
## [1] "The new sundrop commercial ...hehe love at first sight"
## [1] "we need to reconnect THIS WEEK"

## [1] 2360148
## [1] 29706404
## [1] 2818583

Next, loading in files using the readtext package

tweetFile <- readtext(twitter)
corpusTwitter <- corpus(tweetFile, cache = FALSE)
summary(corpusTwitter)
## Corpus consisting of 1 document:
## 
##               Text  Types   Tokens Sentences
##  en_US.twitter.txt 566951 36719658   2588551
## 
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Sun Jul 29 03:34:49 2018
## Notes:

Task 1: Getting and cleaning the data

Tasks to accomplish

  1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
  2. Profanity filtering - removing profanity and other words you do not want to predict.

Tips, tricks, and hints

  1. Loading the data in. This dataset is fairly large. We emphasize that you don’t necessarily need to load the entire dataset in to build your algorithms (see point 2 below). At least initially, you might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. You can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R. For example, the following code could be used to read the first few lines of the English Twitter dataset:con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done See the ?connections help page for more information.

  2. Sampling. To reiterate, to build models you don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to “flip a biased coin” to determine whether you sample a line of text or not.

Sub-Sampling.

twitterSubSampling <- paste0(twitter, ".sub-sampling.txt")
if(!file.exists(twitterSubSampling)) {
        subSamplingSize <- 10000
        flipABiasedCoin <- rbinom(tweets, size = 1, prob = subSamplingSize / tweets)
        conRead <- file(twitter, "r")
        conWrite <- file(twitterSubSampling, "w")
        len <- 0
        while (length(oneLine <- readLines(conRead, 1, warn = FALSE)) > 0) {
                len <- len + 1
                if(flipABiasedCoin[len] == 1) {
                        writeLines(oneLine, conWrite)
                }
        }
        close(conRead)
        close(conWrite)
}

subTweets <- as.numeric(countLines(twitterSubSampling))
subTweets
## [1] 9970

Tokenization.

subTweetFile <- readtext(twitterSubSampling)
subTwitterCorpus <- corpus(subTweetFile, cache = FALSE)
summary(subTwitterCorpus)
## Corpus consisting of 1 document:
## 
##                                Text Types Tokens Sentences
##  en_US.twitter.txt.sub-sampling.txt 20110 154790     10982
## 
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Sun Jul 29 03:36:28 2018
## Notes:

Load bad words.

profanity <- readLines(badwords)

Task 2: Exploratory Data Analysis

Tasks to accomplish

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
  3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  4. How do you evaluate how many of the words come from foreign languages?
  5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
Field Unit Sample sequence 1-gram sequence 2-gram sequence 3-gram sequence
Computational linguistics word … to be or not to be … …, to, be, or, not, to, be, … …, to be, be or, or not, not to, to be, … …, to be or, be or not, or not to, not to be, …

Top 20.

subTweetsDfm <- dfm(subTwitterCorpus)
topfeatures(subTweetsDfm, 20)
##     .     !   the    to     ,     i     a   you   and     ?     :   for 
## 10753  5448  3830  3312  3172  3063  2657  2265  1809  1787  1695  1660 
##    in    of    is     "    my    it    on  that 
##  1596  1531  1493  1309  1307  1264  1218   927

Plot word cloud.

subTweetsDfm %>% 
        dfm_trim(min_termfreq = 10,
                 verbose = FALSE) %>%
        textplot_wordcloud(min_count = 6,
                           random_order = FALSE, 
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

Nomarize words.

subTweetsDfmNomarized <- subTwitterCorpus %>% 
        # nomarize words
        tokens(remove_punct = TRUE,
               remove_numbers = TRUE) %>%
        # removing profanity and other words
        tokens_remove(stopwords('english')) %>%
        tokens_remove(profanity)

Top 20 Nomarized words.

topfeatures(dfm(subTweetsDfmNomarized), 20)
##   just   like    get   love   good    day      u     rt    can thanks 
##    641    494    470    448    427    402    373    367    366    361 
##    now    one   time  great   know  today    new    lol     go    see 
##    342    330    328    311    308    300    272    270    268    261

Plot word cloud.

dfm(subTweetsDfmNomarized) %>%
        dfm_trim(min_termfreq = 10,
                 verbose = FALSE) %>%
        textplot_wordcloud(min_count = 6,
                           random_order = FALSE,
                           max_words = 100,
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

Frequency Plots

featuresTweetsDfm <- textstat_frequency(dfm(subTweetsDfmNomarized), n = 80)

# Sort by reverse frequency order
featuresTweetsDfm$feature <- with(featuresTweetsDfm, reorder(feature, -frequency))

ggplot(featuresTweetsDfm, aes(x = feature, y = frequency)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

2-Gram

subTweetsDfmNomarized2Gram <- subTwitterCorpus %>% 
        # nomarize words
        tokens(remove_punct = TRUE,
               remove_numbers = TRUE) %>%
        # removing profanity and other words
#        tokens_remove(stopwords('english')) %>%
        tokens_remove(profanity) %>%
        tokens_ngrams(n = 2)
topfeatures(dfm(subTweetsDfmNomarized2Gram), 20)
##     in_the    for_the     of_the     on_the      to_be   going_to 
##        319        290        243        218        187        165 
## thanks_for     to_the     i_love  thank_you     if_you      for_a 
##        164        158        156        155        155        144 
##     at_the     have_a       i_am       is_a     to_get    will_be 
##        141        137        128        126        117        116 
##      i_was     i_have 
##        111        111
dfm(subTweetsDfmNomarized2Gram) %>%
        dfm_trim(min_termfreq = 10,
                 verbose = FALSE) %>%
        textplot_wordcloud(min_count = 6,
                           random_order = FALSE,
                           max_words = 100,
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

Frequency Plots

featuresTweetsDfm2Gram <- textstat_frequency(dfm(subTweetsDfmNomarized2Gram), n = 80)

# Sort by reverse frequency order
featuresTweetsDfm2Gram$feature <- with(featuresTweetsDfm2Gram, reorder(feature, -frequency))

ggplot(featuresTweetsDfm2Gram, aes(x = feature, y = frequency)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

3-Gram

subTweetsDfmNomarized3Gram <- subTwitterCorpus %>% 
        # nomarize words
        tokens(remove_punct = TRUE,
               remove_numbers = TRUE) %>%
        # removing profanity and other words
        tokens_remove(stopwords('english')) %>%
        tokens_remove(profanity) %>%
        tokens_ngrams(n = 3)
topfeatures(dfm(subTweetsDfmNomarized3Gram), 20)
##     happy_mother's_day         happy_new_year      happy_mothers_day 
##                     10                     10                      8 
##               la_la_la looking_forward_seeing            let_us_know 
##                      8                      7                      6 
##     please_follow_back         good_right_now          cinco_de_mayo 
##                      4                      4                      4 
##   please_please_please       feel_better_soon         today_good_day 
##                      4                      3                      3 
##          just_got_back  merry_christmas_happy           come_join_us 
##                      3                      3                      3 
##     happy_st_patrick's       st_patrick's_day         cant_wait_hear 
##                      3                      3                      3 
##          run_time_nike          time_nike_gps 
##                      3                      3
dfm(subTweetsDfmNomarized3Gram) %>%
        textplot_wordcloud(min_count = 4,
                           random_order = FALSE,
                           max_words = 50,
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

Frequency Plots

featuresTweetsDfm3Gram <- textstat_frequency(dfm(subTweetsDfmNomarized3Gram), 60)

# Sort by reverse frequency order
featuresTweetsDfm3Gram$feature <- with(featuresTweetsDfm3Gram, reorder(feature, -frequency))

ggplot(featuresTweetsDfm3Gram, aes(x = feature, y = frequency)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

1-gram 90%tile:

featuresTweetsDfmFull <- textstat_frequency(dfm(subTweetsDfmNomarized))
summary(featuresTweetsDfmFull)
##    feature            frequency            rank          docfreq 
##  Length:15408       Min.   :  1.000   Min.   :    1   Min.   :1  
##  Class :character   1st Qu.:  1.000   1st Qu.: 3853   1st Qu.:1  
##  Mode  :character   Median :  1.000   Median : 7704   Median :1  
##                     Mean   :  4.481   Mean   : 7704   Mean   :1  
##                     3rd Qu.:  2.000   3rd Qu.:11556   3rd Qu.:1  
##                     Max.   :641.000   Max.   :15408   Max.   :1  
##     group          
##  Length:15408      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
quantile(featuresTweetsDfmFull$frequency, c(0, .1, .5, .9, 1))
##   0%  10%  50%  90% 100% 
##    1    1    1    7  641

2-gram 90%tile:

featuresTweetsDfm2GramFull <- textstat_frequency(dfm(subTweetsDfmNomarized2Gram))
summary(featuresTweetsDfm2GramFull)
##    feature            frequency            rank          docfreq 
##  Length:77857       Min.   :  1.000   Min.   :    1   Min.   :1  
##  Class :character   1st Qu.:  1.000   1st Qu.:19465   1st Qu.:1  
##  Mode  :character   Median :  1.000   Median :38929   Median :1  
##                     Mean   :  1.576   Mean   :38929   Mean   :1  
##                     3rd Qu.:  1.000   3rd Qu.:58393   3rd Qu.:1  
##                     Max.   :319.000   Max.   :77857   Max.   :1  
##     group          
##  Length:77857      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
quantile(featuresTweetsDfm2GramFull$frequency, c(0, .1, .5, .9, 1))
##   0%  10%  50%  90% 100% 
##    1    1    1    2  319

3-gram 90%tile:

featuresTweetsDfm3GramFull <- textstat_frequency(dfm(subTweetsDfmNomarized3Gram))
summary(featuresTweetsDfm3GramFull)
##    feature            frequency           rank          docfreq 
##  Length:68757       Min.   : 1.000   Min.   :    1   Min.   :1  
##  Class :character   1st Qu.: 1.000   1st Qu.:17190   1st Qu.:1  
##  Mode  :character   Median : 1.000   Median :34379   Median :1  
##                     Mean   : 1.004   Mean   :34379   Mean   :1  
##                     3rd Qu.: 1.000   3rd Qu.:51568   3rd Qu.:1  
##                     Max.   :10.000   Max.   :68757   Max.   :1  
##     group          
##  Length:68757      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
quantile(featuresTweetsDfm3GramFull$frequency, c(0, .1, .5, .9, 1))
##   0%  10%  50%  90% 100% 
##    1    1    1    1   10
ntoken(subTweetsDfmNomarized)
## en_US.twitter.txt.sub-sampling.txt 
##                              69046
ntype(subTweetsDfmNomarized)
## en_US.twitter.txt.sub-sampling.txt 
##                              18775

Task 3: Modeling

Tasks to accomplish

  1. Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
  2. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Questions to consider

  1. How can you efficiently store an n-gram model (think Markov Chains)?
  2. How can you use the knowledge about word frequencies to make your model smaller and more efficient?
  3. How many parameters do you need (i.e. how big is n in your n-gram model)?
  4. Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ?
  5. How do you evaluate whether your model is any good?
  6. How can you use backoff models to estimate the probability of unobserved n-grams?

Basic 2-gram model:

nextWords2Gram <- function(input) {
        featuresNextWord <- NULL

        nextWordDfm <- dfm(
                tokens_select(
                        subTweetsDfmNomarized2Gram, 
                        paste0("^", input, "_.*"),
                        valuetype ="regex"))
        
        if(length(nextWordDfm) > 0) {
                featuresNextWord <- textstat_frequency(nextWordDfm, n = 5)
                featuresNextWord$feature <- 
                        sapply(as.vector(featuresNextWord$feature), 
                               function(x){
                                       str_split(x, "_")[[1]][2]
                               })
                # Sort by reverse frequency order
                featuresNextWord$feature <- with(featuresNextWord, reorder(feature, -frequency))
        
        } else {
                # un-seen n-gram.
                # backoff models?
        }

        featuresNextWord
}

Next word of Looking is:

ggplot(nextWords2Gram("Looking"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

Next word of forward is:

ggplot(nextWords2Gram("forward"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

I went to be

ggplot(nextWords2Gram("went"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

ggplot(nextWords2Gram("to"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

ggplot(nextWords2Gram("be"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

ggplot(nextWords2Gram("a"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

ggplot(nextWords2Gram("great"), aes(x = feature, y = frequency)) +
        geom_bar(stat = "identity") + 
        xlab("Next word")

References